Learning Goals

At the end of this exercise, you will be able to:
1. Practice applying the syntax of building plots using ggplot2.
2. Build a boxplot using ggplot2.
3. Customize labels on axes using labs and themes.
4. Use color, fill, and group to customize plots and improve overall aesthetics.

Where have we been, and where are we going?

At this point, we’ve plotted a few things and have a high-level understanding of how R can read in and summarize data. It is OK if you need to go back through past work and find bits of code that work for you, but try and force yourself to originate new chunks.

Libraries

library(tidyverse)
library(naniar)
library(janitor)
library(here)
library(skimr)
library(janitor)
library(here)
library(palmerpenguins)

Review

Now that you have been introduced to ggplot, let’s practice a few more plot types. Remember that plots are built in layers: plot= data + geom_ + aesthetics. We have to specify each of these in order for a plot to be produced. If you get stuck, it is often helpful to stop and make a quick sketch of what you want or expect to see on a piece of scratch paper.

Let’s review using the penguins data. First, get an idea of the structure: Are there NA’s? Are the variables discrete, categorical, or continuous?

penguins
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge…           39.1          18.7              181        3750
##  2 Adelie  Torge…           39.5          17.4              186        3800
##  3 Adelie  Torge…           40.3          18                195        3250
##  4 Adelie  Torge…           NA            NA                 NA          NA
##  5 Adelie  Torge…           36.7          19.3              193        3450
##  6 Adelie  Torge…           39.3          20.6              190        3650
##  7 Adelie  Torge…           38.9          17.8              181        3625
##  8 Adelie  Torge…           39.2          19.6              195        4675
##  9 Adelie  Torge…           34.1          18.1              193        3475
## 10 Adelie  Torge…           42            20.2              190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

To start, let’s determine how many penguins are on an island in this data.

penguins %>% count(island, species, sort = F)
## # A tibble: 5 x 3
##   island    species       n
##   <fct>     <fct>     <int>
## 1 Biscoe    Adelie       44
## 2 Biscoe    Gentoo      124
## 3 Dream     Adelie       56
## 4 Dream     Chinstrap    68
## 5 Torgersen Adelie       52

What if we wanted a plot that showed the number of measured penguins per species and the number of measured penguins per island?

penguins %>% 
  group_by(island) %>% 
  ggplot(aes(x=species))+
  geom_bar()

How about average bill length by sex?

penguins %>% 
  filter(sex!="NA") %>% 
  group_by(sex) %>% 
  summarise(ave_bill_length=mean(bill_length_mm))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   sex    ave_bill_length
##   <fct>            <dbl>
## 1 female            42.1
## 2 male              45.9
penguins %>% 
  filter(sex!="NA") %>% 
  group_by(sex) %>% 
  summarise(ave_bill_length=mean(bill_length_mm)) %>% 
  ggplot(aes(x=sex, y=ave_bill_length)) +
  geom_col()
## `summarise()` ungrouping output (override with `.groups` argument)

Phew! So far so good. Now let’s take a more in-depth example

Database of vertebrate home range sizes.
Reference: Tamburello N, Cote IM, Dulvy NK (2015) Energy and the scaling of animal space use. The American Naturalist 186(2):196-211. http://dx.doi.org/10.1086/682070.
Data: http://datadryad.org/resource/doi:10.5061/dryad.q5j65/1

homerange <- read_csv("Tamburelloetal_HomeRangeDatabase.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   mean.mass.g = col_double(),
##   log10.mass = col_double(),
##   mean.hra.m2 = col_double(),
##   log10.hra = col_double(),
##   preymass = col_double(),
##   log10.preymass = col_double(),
##   PPMR = col_double()
## )
## See spec(...) for full column specifications.

Practice

  1. What is the structure of the homerange data? Does it have any NA’s? Do a quick exploratory analysis of your choice below.
glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon                      <chr> "lake fishes", "river fishes", "river fish…
## $ common.name                <chr> "american eel", "blacktail redhorse", "cen…
## $ class                      <chr> "actinopterygii", "actinopterygii", "actin…
## $ order                      <chr> "anguilliformes", "cypriniformes", "cyprin…
## $ family                     <chr> "anguillidae", "catostomidae", "cyprinidae…
## $ genus                      <chr> "anguilla", "moxostoma", "campostoma", "cl…
## $ species                    <chr> "rostrata", "poecilura", "anomalum", "fund…
## $ primarymethod              <chr> "telemetry", "mark-recapture", "mark-recap…
## $ N                          <chr> "16", NA, "20", "26", "17", "5", "2", "2",…
## $ mean.mass.g                <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00…
## $ log10.mass                 <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ mean.hra.m2                <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, …
## $ log10.hra                  <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437…
## $ hra.reference              <chr> "Minns, C. K. 1995. Allometry of home rang…
## $ realm                      <chr> "aquatic", "aquatic", "aquatic", "aquatic"…
## $ thermoregulation           <chr> "ectotherm", "ectotherm", "ectotherm", "ec…
## $ locomotion                 <chr> "swimming", "swimming", "swimming", "swimm…
## $ trophic.guild              <chr> "carnivore", "carnivore", "carnivore", "ca…
## $ dimension                  <chr> "3D", "2D", "2D", "2D", "2D", "2D", "2D", …
## $ preymass                   <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, …
## $ log10.preymass             <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA,…
## $ PPMR                       <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, N…
## $ prey.size.reference        <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 2…
naniar::miss_var_summary(homerange)
## # A tibble: 24 x 3
##    variable                   n_miss pct_miss
##    <chr>                       <int>    <dbl>
##  1 alternative.mass.reference    561   98.6  
##  2 preymass                      502   88.2  
##  3 log10.preymass                502   88.2  
##  4 PPMR                          502   88.2  
##  5 prey.size.reference           502   88.2  
##  6 N                             375   65.9  
##  7 primarymethod                   1    0.176
##  8 taxon                           0    0    
##  9 common.name                     0    0    
## 10 class                           0    0    
## # … with 14 more rows

1. Scatter Plot: Review

Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression lines or calculate any r2 values.

In the case below, let’s explore whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.

As we go through each of these steps, quiz yourself on what you remember.

Let’s first grab our variables. What’s a command we can use to know what they are called?

names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"

Remember that big datasets have data that may overlap. We used geom_jitter() before to add some random noise to the data and separating some of the individual points. Let’s write a plot for this.

ggplot(data = homerange, mapping = aes(x = log10.mass, y = log10.hra)) +
  geom_jitter()

And let’s add a best fit line. What did we change that added the line?

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
  geom_point()+
  geom_smooth(method=lm, se=T) #adds the regression line, `se=TRUE` will add standard error
## `geom_smooth()` using formula 'y ~ x'

2. Bar Plot: Review and expand

Let’s take a look at the variable trophic.guild. We’ll repeat some of the steps from before in barplotting, and then we’ll practice plotting new things!

We will first specify an x-axis and a y-axis.

homerange %>% 
  filter(family=="salmonidae") %>% 
  ggplot(aes(x=common.name, y=log10.mass))+
  geom_col()

Take a look at the graph. What’s that line that says ‘filter’ probably doing?

Let’s plot another one. Take a look at the graph and trace back what the code was asking.

homerange %>% 
  group_by(class) %>% 
  summarize(mean_body_wt=mean(log10.mass)) %>% 
  ggplot(aes(x=class, y=mean_body_wt))+
  geom_col()
## `summarise()` ungrouping output (override with `.groups` argument)

Checking in. How would we…?

  1. Filter the homerange data to include mammals only.
homerange %>% 
  filter(class=="mammalia")
## # A tibble: 238 x 24
##    taxon common.name class order family genus species primarymethod N    
##    <chr> <chr>       <chr> <chr> <chr>  <chr> <chr>   <chr>         <chr>
##  1 mamm… giant gold… mamm… afro… chrys… chry… trevel… telemetry*    <NA> 
##  2 mamm… Grant's go… mamm… afro… chrys… erem… granti  telemetry*    <NA> 
##  3 mamm… pronghorn   mamm… arti… antil… anti… americ… telemetry*    <NA> 
##  4 mamm… impala      mamm… arti… bovid… aepy… melamp… telemetry*    <NA> 
##  5 mamm… hartebeest  mamm… arti… bovid… alce… busela… telemetry*    <NA> 
##  6 mamm… barbary sh… mamm… arti… bovid… ammo… lervia  telemetry*    <NA> 
##  7 mamm… American b… mamm… arti… bovid… bison bison   telemetry*    <NA> 
##  8 mamm… European b… mamm… arti… bovid… bison bonasus telemetry*    <NA> 
##  9 mamm… goat        mamm… arti… bovid… capra hircus  telemetry*    <NA> 
## 10 mamm… Spanish ib… mamm… arti… bovid… capra pyrena… telemetry*    <NA> 
## # … with 228 more rows, and 15 more variables: mean.mass.g <dbl>,
## #   log10.mass <dbl>, alternative.mass.reference <chr>, mean.hra.m2 <dbl>,
## #   log10.hra <dbl>, hra.reference <chr>, realm <chr>, thermoregulation <chr>,
## #   locomotion <chr>, trophic.guild <chr>, dimension <chr>, preymass <dbl>,
## #   log10.preymass <dbl>, PPMR <dbl>, prey.size.reference <chr>
  1. Make a bar plot that shows the relative numbers of herbivores or carnivores in our dataset (e.g. Are there more herbivores or carnivores in mammals?)
homerange %>% 
  filter(class=="mammalia") %>% 
  group_by(trophic.guild) %>% 
  count()
## # A tibble: 2 x 2
## # Groups:   trophic.guild [2]
##   trophic.guild     n
##   <chr>         <int>
## 1 carnivore        80
## 2 herbivore       158
homerange %>% 
  filter(class=="mammalia") %>% 
  group_by(trophic.guild) %>% 
  count() %>% 
  ggplot(aes(y=trophic.guild, x=n))+
  geom_col()

Box Plots

Boxplots help us visualize a range of values. So, on the x-axis we typically have something categorical and the y-axis is the range. In the case below, we are plotting log10.mass by taxonomic class in the homerange data. geom_boxplot() is the geom type for a standard box plot. The center line in each box represents the median, not the mean.

Let’s look at the variable log10.mass grouped by taxonomic class.

homerange %>% 
  group_by(class) %>% 
  summarize(min_log10.mass=min(log10.mass),
            max_log10.mass=max(log10.mass),
            median_log10.mass=median(log10.mass))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 4
##   class          min_log10.mass max_log10.mass median_log10.mass
##   <chr>                   <dbl>          <dbl>             <dbl>
## 1 actinopterygii         -0.658           3.55              2.08
## 2 aves                    0.712           4.95              1.82
## 3 mammalia                0.620           6.60              3.33
## 4 reptilia                0.539           4.03              2.51
homerange %>% 
  ggplot(aes(x = class, y = log10.mass)) +
  geom_boxplot()

Practice

  1. There are more herbivores than carnivores in the homerange data, but how do their masses compare? Make a summary and boxplot that compares their masses. Use log10.mass.
homerange %>% 
  group_by(trophic.guild) %>% 
  summarize(min_log10.mass=min(log10.mass),
            max_log10.mass=max(log10.mass),
            mean_log10.mass=mean(log10.mass),
            median_log10.mass=median(log10.mass),
            total_n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 6
##   trophic.guild min_log10.mass max_log10.mass mean_log10.mass median_log10.ma…
##   <chr>                  <dbl>          <dbl>           <dbl>            <dbl>
## 1 carnivore             -0.658           5.05            2.24             2.28
## 2 herbivore              0.398           6.60            3.13             2.97
## # … with 1 more variable: total_n <int>
homerange %>% 
  ggplot(aes(x=trophic.guild, y=log10.mass))+
  geom_boxplot()

  1. Have a closer look at carnivorous mammals. Summarize and visualize the range of log10.mass by family.
homerange %>% 
  filter(taxon=="mammals" & trophic.guild=="carnivore") %>% 
  summarize(min_log10.mass=min(log10.mass),
            max_log10.mass=max(log10.mass),
            mean_log10.mass=mean(log10.mass),
            median_log10.mass=median(log10.mass),
            total_n=n()) %>% 
  pivot_longer(cols=everything(),
               names_to="measurement",
               values_to="value")
## # A tibble: 5 x 2
##   measurement        value
##   <chr>              <dbl>
## 1 min_log10.mass     0.620
## 2 max_log10.mass     5.05 
## 3 mean_log10.mass    3.03 
## 4 median_log10.mass  3.29 
## 5 total_n           80
  1. Now use a boxplot to visualize the range of body mass by family of mammalian carnivore.
homerange %>% 
  filter(taxon=="mammals" & trophic.guild=="carnivore") %>% 
  select(family, log10.mass) 
## # A tibble: 80 x 2
##    family          log10.mass
##    <chr>                <dbl>
##  1 chrysochloridae       2.64
##  2 chrysochloridae       1.36
##  3 canidae               3.70
##  4 canidae               4.44
##  5 canidae               3.87
##  6 canidae               3.60
##  7 canidae               3.65
##  8 canidae               3.51
##  9 canidae               3.32
## 10 eupleridae            3.98
## # … with 70 more rows
homerange %>% 
  filter(taxon=="mammals" & trophic.guild=="carnivore") %>% 
  select(family, log10.mass) %>% 
  ggplot(aes(x=family, y=log10.mass))+
  geom_boxplot()+
  coord_flip()+
  labs(title="Family vs. Log10.mass")

Aesthetics: Labels

Now that we have practiced scatter plots, bar plots, and box plots we need to learn how to adjust their appearance to suit our needs. Let’s start with labeling x and y axes.

Let’s revisit the plot of log10.mass and log10.hra.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
  geom_point()+
  geom_smooth(method=lm, se=T)
## `geom_smooth()` using formula 'y ~ x'

The plot looks clean, but it is incomplete. A reader unfamiliar with the data might have a difficult time interpreting the labels. To add custom labels, we use the labs command.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
  geom_point()+
  geom_smooth(method=lm, se=T)+
  labs(title="Log 10 Mass vs. Log 10 Home Range Area",
       x="Log 10 Mass (g)",
       y="Log 10 Home Range Area (m2)")
## `geom_smooth()` using formula 'y ~ x'

We can improve the plot further by adjusting the size and face of the text. We do this using theme().

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
  geom_point()+
  geom_smooth(method=lm, se=T)+
  labs(title="Log 10 Mass vs. Log 10 Home Range Area",
       x="Log 10 Mass (g)",
       y="Log 10 Home Range Area (m2)") +
  theme(plot.title = element_text(size = 12, face = "bold"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 10))
## `geom_smooth()` using formula 'y ~ x'

The rel() option changes the relative size of the title to keep things consistent. Adding hjust allows control of title position.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
  geom_point()+
  geom_smooth(method=lm, se=T)+
  labs(title="Log 10 Mass vs. Log 10 Home Range Area",
       x="Log 10 Mass",
       y="Log 10 Home Range Area (m2)") +
  theme(plot.title = element_text(size = rel(1.5), hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'

Other Aesthetics

There are lots of options for aesthetics. An aesthetic can be assigned to either numeric or categorical data. fill is a common grouping option; notice that an appropriate key is displayed when you use one of these options.

  ggplot(data=homerange, aes(x=realm, fill=realm))+geom_bar()+
  labs(title = "# Species by Realm",
       x = "realm",
       y = NULL) +
  theme(plot.title = element_text(size = rel(1.5), hjust = 0.5))

size adjusts the size of points relative to a continuous variable.

  ggplot(data=homerange, aes(x=log10.hra, y=log10.mass, size=log10.hra))+
  geom_point(na.rm=T)

A few more useful aesthetics

There are many options to create nice plots in ggplot. One useful trick is to store the plot as a new object and then experiment with geom’s and aesthetics.

p <- homerange %>% 
  ggplot(aes(x= log10.mass, y= log10.hra))

Play with point size.

p +geom_point(size=1)

Map shapes to another categorical variable.

p+geom_point(aes(shape=thermoregulation, color=thermoregulation), size=1.75)

Barplots and multiple variables

At this point you should be comfortable building bar plots that shows counts of observations using geom_bar(). Last time we explored the fill option in box plots as a way to bring color to the plot; i.e. we filled by the same variable that we were plotting. What happens when we fill by a different categorical variable?

Let’s start by counting how many observations we have in each taxonomic group.

homerange %>% count(taxon)
## # A tibble: 9 x 2
##   taxon             n
##   <chr>         <int>
## 1 birds           140
## 2 lake fishes       9
## 3 lizards          11
## 4 mammals         238
## 5 marine fishes    90
## 6 river fishes     14
## 7 snakes           41
## 8 tortoises        12
## 9 turtles          14

Now let’s make a bar plot of these data.

homerange %>% 
  ggplot(aes(x = taxon)) + geom_bar() +
  coord_flip() +
  labs(title = "Observations by Taxon in Homerange Data",
       x = "Taxonomic Group")

By specifying fill=trophic.guild we build a stacked bar plot that shows the proportion of a given taxonomic group that is an herbivore or carnivore.

homerange %>% 
  ggplot(aes(x = taxon, fill = trophic.guild)) + geom_bar() +
  coord_flip() +
  labs(title = "Observations by Taxon in Homerange Data",
       x = "Taxonomic Group",
       fill = "Trophic Guild")

We can also have counts of each trophic guild within taxonomic group shown side-by-side by specifying position="dodge".

homerange %>% 
  ggplot(aes(x = taxon, fill = trophic.guild)) + geom_bar(position = "dodge") +
  coord_flip() +
  labs(title = "Observations by Taxon in Homerange Data",
       x = "Taxonomic Group",
       fill = "Trophic Guild")

Here is the same plot oriented vertically.

homerange %>% 
  ggplot(aes(x = taxon, fill = trophic.guild)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  labs(title = "Observations by Taxon in Homerange Data",
       x = "Taxonomic Group",
       fill= "Trophic Guild")

We can also scale all bars to a percentage (or proportion).

homerange %>% 
  ggplot(aes(x = taxon, fill = trophic.guild))+
  geom_bar(position = position_fill())+ 
  scale_y_continuous(labels = scales::percent)+
  coord_flip()

That’s it! Great work!